feat(bench): live benchmark run — full schemas + Recall@k + latency (MCP-42a)#748
Conversation
…MCP-42a) Extends the bench/ harness (PR #747) with a live run against a running proxy: - Exact token number: GET /api/v1/tools pulls upstream tools WITH full JSON input schemas; proxy-mode tools carry their live schemas via the extended server.ProxyModeToolDefs (BenchProxyToolDef.Schema). Schemas counted on BOTH sides so the headline savings is authoritative — and withheld (authoritative_headline=false) if any proxy tool lacks a schema, the MCP-3161 overstatement guard. - Accuracy: replays the Spec 065 retrieval golden set through the proxy BM25 search (GET /api/v1/index/search) and scores Recall@{1,3,5,10}/MRR/nDCG@10/MAP against graded labels (deterministic, no LLM). Field names mirror Spec 065 score-report.schema.json. - Latency: client-measured per-query search latency (p50/p95/p99/max) vs. the one-shot load-all-tools cost (server "took" is a 0ms stub). CLI: `go run ./bench/cmd/bench -live -proxy URL -api-key KEY`. Reports stay gitignored (CN-003). All metric math + the live client are unit-tested with httptest stubs; the docker-compose substrate is the live-reproduction path. Co-Authored-By: Paperclip <noreply@paperclip.ing>
Deploying mcpproxy-docs with
|
| Latest commit: |
0602786
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://2112823c.mcpproxy-docs.pages.dev |
| Branch Preview URL: | https://feat-mcp-42a-live-bench.mcpproxy-docs.pages.dev |
|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
📦 Build ArtifactsWorkflow Run: View Run Available Artifacts
How to DownloadOption 1: GitHub Web UI (easiest)
Option 2: GitHub CLI gh run download 27976001588 --repo smart-mcp-proxy/mcpproxy-go
|
ConvertGenericToolsToTyped read generic["schema"], but every producer of the generic tool map (runtime/server GetServerTools, mcp.go) emits the upstream input schema under "inputSchema". The /api/v1/tools response therefore dropped every schema, so the MCP-42a live benchmark baseline was silently a description-only token count instead of the required full-schema count, while still able to emit authoritative_headline=true. - Read "inputSchema" first in the converter, keep "schema" as a legacy fallback. - Gate the live headline on baseline schemas too (BaselineSchemasCounted via anyHaveSchema): a systematically schema-less baseline now withholds the headline instead of claiming a full-schema baseline it never had. - Tests: converter preserves inputSchema (+legacy schema fallback); headline withheld when the baseline carries no schemas. Related #748
There was a problem hiding this comment.
✅ Gatekeeper approval — review verdict: ACCEPT (by KimiReviewer, model-diverse fallback).
This approval is posted automatically by the MCPProxy Gatekeeper App on behalf of KimiReviewer — the model-diverse reviewer-fallback reviewer of record (verdict lives in the Paperclip review thread). Author≠approver satisfied; QA + CI gates enforced separately.
Auto-approved per Model B (MCP-1249) + reviewer-fallback (MCP-3066).
…hema Addresses CodexReviewer finding on PR #748 / MCP-3167: the live `retrieval` payload emitted flat metric fields, but score-report.schema.json requires nested `retrieval.metrics` + `retrieval.gate`. Restructure RetrievalMetrics into {metrics, gate} so live_report.json validates against the contract, proven by a new jsonschema-validation test (TestRetrievalMetricsConformsToScoreReportSchema). A standalone live run has no stored baseline, so gate.passed is true by construction (CI regression-gating against a committed baseline is MCP-3133). Co-Authored-By: Paperclip <noreply@paperclip.ing>
There was a problem hiding this comment.
✅ Gatekeeper approval — review verdict: ACCEPT (by KimiReviewer, model-diverse fallback).
This approval is posted automatically by the MCPProxy Gatekeeper App on behalf of KimiReviewer — the model-diverse reviewer-fallback reviewer of record (verdict lives in the Paperclip review thread). Author≠approver satisfied; QA + CI gates enforced separately.
Auto-approved per Model B (MCP-1249) + reviewer-fallback (MCP-3066).
There was a problem hiding this comment.
✅ Gatekeeper approval — MCP-42a live benchmark run (full schemas + Recall@k + latency). CodexReviewer ACCEPT + QATester PASS on head 0602786. Live CLI (-live/-proxy/-api-key/-golden) writes live_report.json without changing offline mode; pulls /api/v1/tools schemas + scores /api/v1/index/search with client-measured latency. CI green. Author≠approver.
Summary
Second slice of the MCP-42 benchmark harness, extending
bench/(PR #747) with a live run against a running proxy. Deterministic and LLM-free.Adds the three measurements the issue asked for:
GET /api/v1/toolspulls upstream tools with their full JSON input schemas; the proxy-mode tools carry their live schemas via the extendedserver.ProxyModeToolDefs(BenchProxyToolDef.Schema, marshaled from the realtools/listInputSchema). Schemas are counted on both sides, so the headline savings is authoritative.authoritative_headline: false) and reports raw token totals only.retrieval_golden_v1.json) through the proxy's BM25 search (GET /api/v1/index/search) and scores Recall@{1,3,5,10}, MRR, nDCG@10, MAP against graded labels. Field names mirror Spec 065score-report.schema.json.SearchToolsResponse.tookis a"0ms"stub.How to run
-livewritesbench/results/live_report.json(gitignored, CN-003). Default (no-live) keeps the offline token run unchanged.Design notes
mcp-evalD1 approach (re-implemented in Go), not its code.Tests
metrics_test.go— Recall@k/MRR/nDCG/MAP against hand-computed worked examples.live_test.go—httptest-stubbed/api/v1/tools+/api/v1/index/search; schema-aware token counting.live_report_test.go— real golden-set load (47 queries), latency percentiles, authoritative-headline path, and the MCP-3161 withhold guard.go test ./bench/... -race,go vet, and strict golangci-lint v2 all clean.Out of scope (follow-ups)
LLM end-to-end task success (pinned model + budget); CI publish-on-tag (Release lane, MCP-3133).
Closes MCP-3132.